1. Field of the Invention
The present invention generally relates to resource classification. In particular, the present invention relates to techniques for automated entity-specific identification and classification of candidate resources.
2. Background
People often search for information on networks such as the Internet and the World
Wide Web service available through it by querying general search engines, such as Yahoo! Search, or localized search engines, such as Yahoo! Local. Queries often request information about particular entities such as a particular school or business.
Real-world entities are often associated with computerized resources such as webpages accessible through the Internet. For a given entity, one or a few webpages identified by one or more uniform resource locaters (URLs) are deemed authoritative or official. For example, an official home page for an entity may be a webpage created by a school, a manufacturer, or an individual such as a professor, athlete, student or actor.
Presently, search engines are generally insufficient to routinely identify official home pages. Official home pages (OHPs) for entities are often not found or are buried in a long list of search results provided in response to user queries. There are several reasons for this. First, entities represented by webpages number in the hundreds of millions, which makes manual labeling of official home pages for those entities infeasible. Second, official home pages may not show up in top search results because they are often not optimized for search engines. Instead, third-party pages such as fan pages, aggregator sites, forums and other webpages are often presented to users as being more relevant to entities searched for than their official homepages because they are often optimized for search engines.
Instead of manually performing multiple searches and culling through results in search of an official home page, users would welcome an automated identification of official home pages for entities they are interested in. However, at least in this case, merely labeling webpages as official without identifying what they are official for isn't helpful. To be helpful, it must not only be determined whether a webpage is an official home page, but for what specific entity it is an official home page.